Data Expo 2008 «Airline on-time performance»

by Mostafa Abobakr

Preliminary Wrangling

This dataset reports flights in the United States, including carriers, arrival and departure delays, and reasons for delays, during year 2008.

Motivation: We will use the dataset to gain insights that could help make improvements against the flights delaying's, or to make backed findings about the best carriers with less delaying's.

I decided to move to work on DB Browser for SQlite to work more rapidly with this huge data points number of over 7 Million, and to get needed columns for my investigation, with the code

SELECT FlightNum,Tailnum, Month,DayofMonth,DayOfWeek, c.Description as Carrier, ArrDelay, Cancelled,CancellationCode,Diverted, CarrierDelay,WeatherDelay,NASDelay,SecurityDelay, Origin,Dest,Distance,TaxiIn,TaxiOut
FROM '2008' as flights
JOIN carriers as c
ON flights.UniqueCarrier = c.Code;</font>

I reduced dataset from 29 to be 19 columns, and I made a left join with carriers.csv data there to get carrier names instead of there codes. I exported the columns to 2008_flights.csv later, then I came back again to jupyter notebook to complete the work.</font>